Heading

Hyperparameter Tuning in Review

Machine learning models use data to fit their internal parameters. However, all models also have parameters that configure how they work and aren’t modified during training, these parameters are called hyperparameters. We can train these hyperparameters by creating many different models, each with different hyperparameters and evaluating each model’s performance. Just like we can overfit model parameters, we can also overfit its hyperparameters. To avoid this, we can estimate performance using nested cross-validation.

Some hyperparameters affect how easily the model will overfit the data, sometimes at the expense of complexity. In our case, this hyperparameter was the depth of the trees in the forest. When we limited the tree depth to just 2, we saw the cross-validation error decrease substantially.

Another hyperparameter that we can modify is the number of features that we choose to model. Random forest models use some features more than others to classify the data. In sklearn we can ask the RandomForestClassifier which features were more important than others. By building a new model that only uses the 10 best features, we were able to improve our performance to 93%.

Q: When to use nested CV?

SOLUTION:

None of the above

Video Code instructions

Notebook Review

If you wanted to interact with the notebook in the video, you can access it here in the repo /activity-classifier/walkthroughs/hyperparameter-tuning/ or in the workspace below.

The dataset that will be used throughout this lesson can be found at the top of the lesson directory at /activity-classifier/data/.

Code

If you need a code on the https://github.com/udacity.

Further Resources

Nested cross-validation can be a tricky concept to wrap your head around. Here are three different explanations from three different authors. Maybe one of the following resources will explain it in a way that clicks for you:

Our code implementing nested CV was pretty verbose so that you could see all the steps. As with almost everything in ML, sklearn can do it for us as well and you can learn more about
Nested CV in sklearn through the documentation.

Is overfitting our hyperparameters really a problem in practice? Yes (or so says this 2010 paper)

An explanation of the difference between hyperparameters and regular parameters with this article from Machine Learning Mastery.

If you want to learn more about Regularization through this article from Towards Data Science.

Glossary

Hyperparameter: A parameter of the model that dictates how the model learns. This is not trained during the training process of the model itself.
Regularization: Regularization is a technique to reduce overfitting of a model by discouraging complexity in the model.
Nested cross-validation: A technique to determine model performance when hyperparameters are also optimized.

heading

Exercise 3: A Quirk in the Dataset

Instructions

Complete the Offline or Online instructions below.
Read through the whole .ipynb.
Complete all the code cells that contain ## Your Code Goes Here.

Offline

In the repo which you can access here in the repo /activity-classifier/exercises/3-quirk-in-the-dataset/) you should find the following files:

1_data_exploration.ipynb

The dataset that will be used throughout this lesson can be found at the top of the lesson directory at /activity-classifier/data/.
Open up the python notebook and associated files in your desired editor.

Note: Instructions can be found in Introduction to Wearable Data's Concept Developer Workflow for how to set up your local environment.

Online

Go to the next concept and the 3_quirk_in_the_dataset.ipynb should be open and the workspace should already contain the appropriate data folder.